6 research outputs found

    Interpretable Machine Learning Methods for Prediction and Analysis of Genome Regulation in 3D

    Get PDF
    With the development of chromosome conformation capture-based techniques, we now know that chromatin is packed in three-dimensional (3D) space inside the cell nucleus. Changes in the 3D chromatin architecture have already been implicated in diseases such as cancer. Thus, a better understanding of this 3D conformation is of interest to help enhance our comprehension of the complex, multipronged regulatory mechanisms of the genome. The work described in this dissertation largely focuses on development and application of interpretable machine learning methods for prediction and analysis of long-range genomic interactions output from chromatin interaction experiments. In the first part, we demonstrate that the genetic sequence information at the ge- nomic loci is predictive of the long-range interactions of a particular locus of interest (LoI). For example, the genetic sequence information at and around enhancers can help predict whether it interacts with a promoter region of interest. This is achieved by building string kernel-based support vector classifiers together with two novel, in- tuitive visualization methods. These models suggest a potential general role of short tandem repeat motifs in the 3D genome organization. But, the insights gained out of these models are still coarse-grained. To this end, we devised a machine learning method, called CoMIK for Conformal Multi-Instance Kernels, capable of providing more fine-grained insights. When comparing sequences of variable length in the su- pervised learning setting, CoMIK can not only identify the features important for classification but also locate them within the sequence. Such precise identification of important segments of the whole sequence can help in gaining de novo insights into any role played by the intervening chromatin towards long-range interactions. Although CoMIK primarily uses only genetic sequence information, it can also si- multaneously utilize other information modalities such as the numerous functional genomics data if available. The second part describes our pipeline, pHDee, for easy manipulation of large amounts of 3D genomics data. We used the pipeline for analyzing HiChIP experimen- tal data for studying the 3D architectural changes in Ewing sarcoma (EWS) which is a rare cancer affecting adolescents. In particular, HiChIP data for two experimen- tal conditions, doxycycline-treated and untreated, and for primary tumor samples is analyzed. We demonstrate that pHDee facilitates processing and easy integration of large amounts of 3D genomics data analysis together with other data-intensive bioinformatics analyses.Mit der Entwicklung von Techniken zur Bestimmung der Chromosomen-Konforma- tion wissen wir jetzt, dass Chromatin in einer dreidimensionalen (3D) Struktur in- nerhalb des Zellkerns gepackt ist. Änderungen in der 3D-Chromatin-Architektur sind bereits mit Krankheiten wie Krebs in Verbindung gebracht worden. Daher ist ein besseres VerstĂ€ndnis dieser 3D-Konformation von Interesse, um einen tieferen Einblick in die komplexen, vielschichtigen Regulationsmechanismen des Genoms zu ermöglichen. Die in dieser Dissertation beschriebene Arbeit konzentriert sich im Wesentlichen auf die Entwicklung und Anwendung interpretierbarer maschineller Lernmethoden zur Vorhersage und Analyse von weitreichenden genomischen Inter- aktionen aus Chromatin-Interaktionsexperimenten. Im ersten Teil zeigen wir, dass die genetische Sequenzinformation an den genomis- chen Loci prĂ€diktiv fĂŒr die weitreichenden Interaktionen eines bestimmten Locus von Interesse (LoI) ist. Zum Beispiel kann die genetische Sequenzinformation an und um Enhancer-Elemente helfen, vorherzusagen, ob diese mit einer Promotorregion von Interesse interagieren. Dies wird durch die Erstellung von String-Kernel-basierten Support Vector Klassifikationsmodellen zusammen mit zwei neuen, intuitiven Visual- isierungsmethoden erreicht. Diese Modelle deuten auf eine mögliche allgemeine Rolle von kurzen, repetitiven Sequenzmotiven (”tandem repeats”) in der dreidimensionalen Genomorganisation hin. Die Erkenntnisse aus diesen Modellen sind jedoch immer noch grobkörnig. Zu diesem Zweck haben wir die maschinelle Lernmethode CoMIK (fĂŒr Conformal Multi-Instance-Kernel) entwickelt, welche feiner aufgelöste Erkennt- nisse liefern kann. Beim Vergleich von Sequenzen mit variabler LĂ€nge in ĂŒberwachten Lernszenarien kann CoMIK nicht nur die fĂŒr die Klassifizierung wichtigen Merkmale identifizieren, sondern sie auch innerhalb der Sequenz lokalisieren. Diese genaue Identifizierung wichtiger Abschnitte der gesamten Sequenz kann dazu beitragen, de novo Einblick in jede Rolle zu gewinnen, die das dazwischen liegende Chromatin fĂŒr weitreichende Interaktionen spielt. Obwohl CoMIK hauptsĂ€chlich nur genetische Se- quenzinformationen verwendet, kann es gleichzeitig auch andere Informationsquellen nutzen, beispielsweise zahlreiche funktionellen Genomdaten sofern verfĂŒgbar. Der zweite Teil beschreibt unsere Pipeline pHDee fĂŒr die einfache Bearbeitung großer Mengen von 3D-Genomdaten. Wir haben die Pipeline zur Analyse von HiChIP- Experimenten zur Untersuchung von dreidimensionalen ArchitekturĂ€nderungen bei der seltenen Krebsart Ewing-Sarkom (EWS) verwendet, welche Jugendliche betrifft. Insbesondere werden HiChIP-Daten fĂŒr zwei experimentelle Bedingungen, Doxycyclin- behandelt und unbehandelt, und fĂŒr primĂ€re Tumorproben analysiert. Wir zeigen, dass pHDee die Verarbeitung und einfache Integration großer Mengen der 3D-Genomik- Datenanalyse zusammen mit anderen datenintensiven Bioinformatik-Analysen erle- ichtert

    All Fingers Are Not the Same: Handling Variable-Length Sequences in a Discriminative Setting Using Conformal Multi-Instance Kernels

    Get PDF
    Most string kernels for comparison of genomic sequences are generally tied to using (absolute) positional information of the features in the individual sequences. This poses limitations when comparing variable-length sequences using such string kernels. For example, profiling chromatin interactions by 3C-based experiments results in variable-length genomic sequences (restriction fragments). Here, exact position-wise occurrence of signals in sequences may not be as important as in the scenario of analysis of the promoter sequences, that typically have a transcription start site as reference. Existing position-aware string kernels have been shown to be useful for the latter scenario. In this work, we propose a novel approach for sequence comparison that enables larger positional freedom than most of the existing approaches, can identify a possibly dispersed set of features in comparing variable-length sequences, and can handle both the aforementioned scenarios. Our approach, emph{CoMIK}, identifies not just the features useful towards classification but also their locations in the variable-length sequences, as evidenced by the results of three binary classification experiments, aided by recently introduced visualization techniques. Furthermore, we show that we are able to efficiently retrieve and interpret the weight vector for the complex setting of multiple multi-instance kernels

    Genetic sequence-based prediction of long-range chromatin interactions suggests a potential role of short tandem repeat sequences in genome organization

    No full text
    Abstract Background Knowing the three-dimensional (3D) structure of the chromatin is important for obtaining a complete picture of the regulatory landscape. Changes in the 3D structure have been implicated in diseases. While there exist approaches that attempt to predict the long-range chromatin interactions, they focus only on interactions between specific genomic regions — the promoters and enhancers, neglecting other possibilities, for instance, the so-called structural interactions involving intervening chromatin. Results We present a method that can be trained on 5C data using the genetic sequence of the candidate loci to predict potential genome-wide interaction partners of a particular locus of interest. We have built locus-specific support vector machine (SVM)-based predictors using the oligomer distance histograms (ODH) representation. The method shows good performance with a mean test AUC (area under the receiver operating characteristic (ROC) curve) of 0.7 or higher for various regions across cell lines GM12878, K562 and HeLa-S3. In cases where any locus did not have sufficient candidate interaction partners for model training, we employed multitask learning to share knowledge between models of different loci. In this scenario, across the three cell lines, the method attained an average performance increase of 0.09 in the AUC. Performance evaluation of the models trained on 5C data regarding prediction on an independent high-resolution Hi-C dataset (which is a rather hard problem) shows 0.56 AUC, on average. Additionally, we have developed new, intuitive visualization methods that enable interpretation of sequence signals that contributed towards prediction of locus-specific interaction partners. The analysis of these sequence signals suggests a potential general role of short tandem repeat sequences in genome organization. Conclusions We demonstrated how our approach can 1) provide insights into sequence features of locus-specific interaction partners, and 2) also identify their cell-line specificity. That our models deem short tandem repeat sequences as discriminative for prediction of potential interaction partners, suggests that they could play a larger role in genome organization. Thus, our approach can (a) be beneficial to broadly understand, at the sequence-level, chromatin interactions and higher-order structures like (meta-) topologically associating domains (TADs); (b) study regions omitted from existing prediction approaches using various information sources (e.g., epigenetic information); and (c) improve methods that predict the 3D structure of the chromatin

    Additional file 1 of Genetic sequence-based prediction of long-range chromatin interactions suggests a potential role of short tandem repeat sequences in genome organization

    No full text
    This file provides additional performance plots and visualizations, and more detailed description of the data. Figure S1: Z-scores for various cell lines at different FDRs Figure S2: Lengths of restriction fragments for various regions in different cell lines Figure S3: Box-plots of SVC performances for all regions (numbered ‘A0-A9’) in GM12878 Figure S4: Box-plots of SVC performances for all regions (numbered ‘B0-B9’) in K562 Figure S5: Box-plots of SVC performances for all regions (numbered ‘C0-C9’) in HeLa-S3 Figure S6: ‘AMPD’ visualization of the informative K-mer pairs from the classifier for region 9 in GM12878 Figure S7: ‘Top25’ visualization of the informative 3-mer pairs separated by various distances and their magnitudes from the classifier for region 7 and 9 in GM12878 Figure S8: ‘Top25’ visualization of the informative 3-mer pairs separated by various distances and their magnitudes from the classifier for region 7 and 9 in GM12878 Figure S9: ‘AMPD’ visualization of the informative K-mer pairs from the classifier for region 7 in K562 Figure S10: ‘Top25’ visualization of the informative 3-mer pairs separated by various distances and their magnitudes from the classifier for region 7 in K562 Figure S11: ‘Top25’ visualization of the informative 3-mer pairs separated by various distances and their magnitudes from the classifier for region 7 in K562 Figure S12: ‘AMPD’ visualization of the informative K-mer pairs from the classifier for region 6 in HeLa Figure S13: ‘Top25’ visualization of the informative 3-mer pairs separated by various distances and their magnitudes from the classifier for region 6 in HeLa Figure S14: ‘Top25’ visualization of the informative 3-mer pairs separated by various distances and their magnitudes from the classifier for region 6 in HeLa Table S1: Details of the genomic regions from each cell line Table S2: Overlap of candidate loci among regions for cell line GM12878 Table S3: Overlap of candidate loci among regions for cell line K562 Table S4: Overlap of candidate loci among regions for cell line HeLa. (PDF 1086 kb
    corecore